Seed-Set Construction by Equi-entropy Partitioning for Efficient and Sensitive Short-Read Mapping

نویسندگان

  • Kouichi Kimura
  • Asako Koike
  • Kenta Nakai
چکیده

Spaced seeds have been shown to be superior to continuous seeds for efficient and sensitive homology search based on the seedand-extend paradigm. Much the same is true in genome mapping of high-throughput short-read data. However, a highly sensitive search with multiple spaced patterns often requires the use of a great amount of index data. We propose a novel seed-set construction method for efficient and sensitive genome mapping of short reads with relatively high error rates, which uses only continuous seeds of variable length allowing a few errors. The seed lengths and allowable error positions are optimized on the basis of entropy, which is a measure of ambiguity or repetitiveness of mapping positions. These seeds can be searched efficiently using the Burrows-Wheeler transform of the reference genome. Evaluation using actual biological SOLiD sequence data demonstrated that our method was competitive in speed and sensitivity using much less memory and disk space in comparison to spaced-seed methods.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Large-Scale Privacy-Preserving Mapping of Human Genomic Sequences on Hybrid Clouds

An operation preceding most human DNA analyses is read mapping, which aligns millions of short sequences (called reads) to a reference genome. This step involves an enormous amount of computation (evaluating edit distances for millions upon billions of sequence pairs) and thus needs to be outsourced to low-cost commercial clouds. This asks for scalable techniques to protect sensitive DNA inform...

متن کامل

The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote

Read alignment is an ongoing challenge for the analysis of data from sequencing technologies. This article proposes an elegantly simple multi-seed strategy, called seed-and-vote, for mapping reads to a reference genome. The new strategy chooses the mapped genomic location for the read directly from the seeds. It uses a relatively large number of short seeds (called subreads) extracted from each...

متن کامل

Lower semicontinuity for parametric set-valued vector equilibrium-like problems

A concept of weak $f$-property for a set-valued mapping is introduced‎, ‎and then under some suitable assumptions‎, ‎which do not involve any information‎ ‎about the solution set‎, ‎the lower semicontinuity of the solution mapping to‎ ‎the parametric‎ ‎set-valued vector equilibrium-like problems are derived by using a density result and scalarization method‎, ‎where the‎ ‎constraint set $K$...

متن کامل

Entropy and MDL discretization of continuous variables for Bayesian belief networks

An efficient algorithm for partitioning the range of a continuous variable to a discrete Ž . number of intervals, for use in the construction of Bayesian belief networks BBNs , is presented here. The partitioning minimizes the information loss, relative to the number of intervals used to represent the variable. Partitioning can be done prior to BBN construction or extended for repartitioning du...

متن کامل

Recent Advances in Complexity Theory

We give new constructions of randomness extractors and lossless condensers that are optimal to within constant factors in both the seed length and the output length. For extractors, thismatches the parameters of the current best known construction [LRV03]; for lossless condensers,the previous best constructions achieved optimality to within a constant factor in one parameteronly at ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011